2025 iThome 鐵人賽

DAY 24

AI & Data

雲端情人 - AI 愛系列第 24 篇

把服務撐起來-重構:健康檢查、監控、穩定度三寶（FastAPI × LINE SDK v3）

17th鐵人賽

Max Cheng

2025-09-17 12:42:09

188 瀏覽

分享至

Day 23｜把服務撐起來：健康檢查、監控、穩定度三寶（FastAPI × LINE SDK v3）

目標：上線後不再靠運氣。今天把專案補上
1. 健康檢查（liveness / readiness）
2. 結構化日誌＋Request ID 關聯
3. 穩定度三件套：超時、重試、熔斷／節流
4. （可選）Prometheus 指標，一眼看懂流量與錯誤

⸻

健康檢查：/healthz 與 /readyz

Liveness 用來告訴平台「我還活著」，Readiness 表示「我現在可以接單了」。
在 Render／K8s／雲主機都通用。

health.py

from fastapi import APIRouter
from fastapi.responses import PlainTextResponse
import httpx, os

router = APIRouter()

@router.get("/healthz")
async def healthz():
# 輕量檢查：進程活著即可
return PlainTextResponse("ok", status_code=200)

@router.get("/readyz")
async def readyz():
# 重要相依：環境變數、外部服務可用性（快速、設超時）
required_envs = ["CHANNEL_ACCESS_TOKEN", "CHANNEL_SECRET"]
missing = [k for k in required_envs if not os.getenv(k)]
if missing:
return PlainTextResponse(f"env missing: {','.join(missing)}", status_code=503)

try:
    # 以 LINE Profile API 當快速探針（或你選擇的輕探針）
    async with httpx.AsyncClient(timeout=2) as c:
        # 不必真的打受權 API，打公共端點或 DNS 也行
        await c.get("https://api.line.me/", timeout=2)
except Exception as e:
    return PlainTextResponse(f"dep: line_api {e}", status_code=503)

return PlainTextResponse("ready", status_code=200)

在 app_fastapi.py：

from health import router as health_router
app.include_router(health_router)

好處
• 平台可自動重啟「掛住的」實例（healthz 失敗）。
• 部署時等到 readiness=OK 才納入流量，避免冷啟期間丟事件。

⸻

結構化日誌＋Request ID

把 log 變成可搜尋的 JSON，並用 Request ID 串起一整條呼叫鏈（Webhook → 你的處理 → 外部 API）。

2.1 Middleware 加 Request ID

logging_setup.py

import logging, json, uuid
from typing import Callable
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import Response
from contextvars import ContextVar

request_id_ctx: ContextVar[str] = ContextVar("request_id", default="-")

class JsonFormatter(logging.Formatter):
def format(self, record):
payload = {
"level": record.levelname,
"msg": record.getMessage(),
"logger": record.name,
"request_id": request_id_ctx.get("-"),
}
if record.exc_info:
payload["exc_info"] = self.formatException(record.exc_info)
return json.dumps(payload, ensure_ascii=False)

def setup_json_logging():
h = logging.StreamHandler()
h.setFormatter(JsonFormatter())
root = logging.getLogger()
root.handlers = []
root.addHandler(h)
root.setLevel(logging.INFO)

class RequestIdMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next: Callable):
rid = request.headers.get("x-request-id", str(uuid.uuid4()))
token = request_id_ctx.set(rid)
try:
resp: Response = await call_next(request)
resp.headers["x-request-id"] = rid
return resp
finally:
request_id_ctx.reset(token)

啟用：

app_fastapi.py

from logging_setup import setup_json_logging, RequestIdMiddleware
setup_json_logging()
app.add_middleware(RequestIdMiddleware)

使用：

import logging
log = logging.getLogger("uvicorn.error")
log.info("handle text event") # 自動帶出 request_id

查問題時，只要在 Log 面板搜尋某個 request_id，就能看到整段過程。

⸻

穩定度三件套：Timeout、Retry、熔斷／節流

3.1 通用的 httpx 輕量封裝（含超時＋重試）

http_helpers.py

import httpx, asyncio, random

async def fetch_json(url: str, headers=None, timeout=4.0, retries=2, backoff=(0.2, 0.8)):
last = None
for i in range(retries + 1):
try:
async with httpx.AsyncClient(timeout=timeout) as c:
r = await c.get(url, headers=headers)
r.raise_for_status()
return r.json()
except Exception as e:
last = e
if i < retries:
await asyncio.sleep(random.uniform(*backoff))
raise last

使用（例如匯率 API）：

from http_helpers import fetch_json
async def get_twd_per(target="JPY"):
data = await fetch_json(f"https://open.er-api.com/v6/latest/{target}")
return data["rates"]["TWD"]

3.2 熔斷 & 節流（超簡版）

避免外部服務掛了你還狂打；也避免某使用者刷爆群組。

circuit.py

import time
FAIL_MAX = 5
COOL_DOWN = 30 # 秒
_state = {"fails":0, "until":0}

def can_call() -> bool:
return time.time() >= _state["until"]

def record(success: bool):
if success:
_state["fails"] = 0
_state["until"] = 0
else:
_state["fails"] += 1
if _state["fails"] >= FAIL_MAX:
_state["until"] = time.time() + COOL_DOWN

呼叫時：

from circuit import can_call, record
if not can_call():
return "外部服務繁忙，稍後再試 🙏"
try:
# do external call...
record(True)
except Exception:
record(False)
raise

使用者節流（per-chat 簡易限流）：

throttle.py

import time
WINDOW = 5 # 秒
MAX_REQ = 8 # 視需求調整
_buckets = {} # chat_id -> [(ts1), (ts2)...]

def allow(chat_id: str) -> bool:
now = time.time()
q = _buckets.setdefault(chat_id, [])
# 移除過期
while q and now - q[0] > WINDOW:
q.pop(0)
if len(q) >= MAX_REQ:
return False
q.append(now)
return True

在文字事件最前面：

from throttle import allow
if not allow(chat_id):
await reply("稍等一下下～我先喘口氣 😮‍💨")
return

⸻

（可選）Prometheus 指標

想要粗粒度監控非常好用：QPS、錯誤率、熱門功能。

metrics.py

from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi import APIRouter
from fastapi.responses import Response
import time
router = APIRouter()

EVENTS_TOTAL = Counter("events_total", "Total LINE events", ["type"])
ERRORS_TOTAL = Counter("errors_total", "Total errors", ["where"])
LATENCY = Histogram("handler_latency_seconds", "Handler latency", ["name"])

@router.get("/metrics")
def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

用法示例

from contextlib import contextmanager
@contextmanager
def observe(name: str):
start = time.time()
try:
yield
finally:
LATENCY.labels(name=name).observe(time.time() - start)

在事件處理時：

from metrics import router as metrics_router, EVENTS_TOTAL, ERRORS_TOTAL, observe
app.include_router(metrics_router)

EVENTS_TOTAL.labels(type="text").inc()
with observe("text_handler"):
... # 你的處理

⸻

例外處理：回應統一、Log 更清楚

from fastapi import Request
from fastapi.responses import JSONResponse
import logging, traceback
log = logging.getLogger("uvicorn.error")

@app.exception_handler(Exception)
async def global_exc(request: Request, exc: Exception):
log.error("unhandled", exc_info=exc)
return JSONResponse({"error": "internal error"}, status_code=500)

⸻

部署前檢查清單
• /healthz 200、/readyz 200（冷啟內可容忍 1–2 秒）
• /metrics 可抓到 events_total、errors_total
• Log 為 JSON，且含 x-request-id
• 外部呼叫皆有 timeout（≤ 5s）與 retries（≤ 2 次）
• 熔斷／節流機制不會讓服務在外部掛掉時「自殺式猛撞」
• 在群組中做壓力測試（10 秒內 50 則文字）仍能正常回應或優雅退讓

⸻

Debug 心法（實戰）
• 找不到某一次回應？
先用 webhook log 的 request_id 去查應用 log；看是否卡在外部呼叫（latency 直方圖也能印證）。
• 翻譯模式忽然失效？
搜 translation_states 的狀態變更 log；多半是 chat_id 變動（換群、換房）或錯誤被全域捕捉吃掉。
• 偶發 503？
檢查 /readyz 探針是否過於嚴格（例如每次都去打慢 API），把 readiness 探針改成輕量檢查即可。

⸻

小結

今天把觀測與韌性一次補齊：
• /healthz / /readyz 讓部署與自動修復更可靠
• JSON 日誌＋Request ID 讓排錯有效率
• 超時＋重試＋熔斷／節流撐住外部不穩的情況
• Prometheus 指標快速看懂尖峰、錯誤與熱門功能

已經具備「生產級」基本功。